begin
    using PlutoUI
    using StatsPlots
    using LinearAlgebra
    using Distributions
    using ForwardDiff
    using MCMCChains
    using DataFrames
    using Random
    using LaTeXStrings
    using HypertextLiteral
    using Logging; Logging.disable_logging(Logging.Info);
end;
TableOfContents()

Bayesian computation with MCMC

Bayesian computation is hard

At a first glance, it might be hard to see why Bayesian computation can be difficult. After all, Baye's theorem has provided us with a very straightforward mechanism to compute the posterior:

$$\text{posterior}=\frac{\text{prior} \times \text{likelihood}}{\text{evidence}}\;\;\text{or}\;\;p(\theta|\mathcal D) =\frac{p(\theta) p(\mathcal D|\theta)}{p(\mathcal D)},$$

where

$$p(\mathcal D) = \int p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta,$$

known as evidence or marginal likelihood, is a constant with respect to the model parameter $\theta$.

The recipe is only easy to write down, but the daunting bit actually lies in the computation of the normalising constant: $p(\mathcal D)$. The integration is often high-dimensional, therefore usually intractable.

As a result, posterior probability calculations, such as $\theta \in A$

$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)},$$

can not be evaluated exactly. For example, in the coin tossing example introduced last time, we may consider any coin with a bias $0.5 \pm 0.05$ fair, the posterior we want to know is

$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D).$$

We sidestepped the integration problem last time by discretising the parameter space $\Theta = [0,1]$ into some finite discrete choices. The method has essentially replaced a difficult integration with a brute-force enumeration summation

$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D) = \frac{\int_{.45}^.55 p(\theta)p(\theta |\mathcal D) \mathrm{d}\theta}{p(\mathcal D)}\approx \frac{\sum_{\theta_0'\in \{.5\}} p(\theta=\theta_0')p(\mathcal D|\theta=\theta_0')}{\sum_{\theta_0\in \{0, 0.1, \ldots, 1.0\}} p(\theta=\theta_0)p(\mathcal D|\theta=\theta_0)}.$$

Unfortunately, this brute force discretisation-based method is not scalable. When the parameter space's dimension gets larger, the algorithm becomes too slow to use. To see it, consider discretising a $D$ dimensional parameter space, $\Theta \in R^D$, if each parameter is discretised with 2 choices (which is a very crude discretisation), the total discretized space is of order $2^D$, which grows exponentially. Such an exponentially growing size soon becomes problematic for all modern computers to handle.

What's worse, the difficulty does not end here. Assuming we knew the normalising constant, i.e. $p(\mathcal D)$ were known and the posterior can be evaluated exactly, we still need to evaluate numerator's integration: $\int_{\theta \in A} p(\theta)p(\mathcal D|\theta) \mathrm{d}\theta$, which is again generally intractable.

To summarise, the difficulty of Bayesian computation are two-folds

  1. the posterior distribution is only evaluable up to some unknown normalising constant;

  2. posterior summary involves integrations, which are intractable.

How to estimate ? – Monte Carlo

Recall that, to summarise a posterior, we need to calculate intractable integrations such as

$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}.$$

More generally, we want to calculate expectations of any functions of random variable $\theta$ under the posterior:

texeq("\\mathbb E[t(\\theta)|\\mathcal D] = \\int t(\\theta) p(\\theta|\\mathcal D) \\mathrm{d}\\theta = \\frac{\\int t(\\theta) p(\\theta) p(\\mathcal D|\\theta) \\mathrm{d}\\theta}{p(\\mathcal D)} \\label{exp}")

Note that when $t(\cdot)$ is a counting function, e.g.

$$t(\theta) = \mathbf 1(\theta \in A)=\begin{cases} 1 & \theta \in A \\ 0 & \theta \notin A,\end{cases}$$

we recover the first question as

$$\mathbb E[t(\theta)|\mathcal D] = \int \mathbf{1}(\theta\in A) p(\theta|\mathcal D) \mathrm{d}\theta = \int_{\theta \in A} 1\cdot p(\theta|\mathcal D) \mathrm{d}\theta = \mathbb P(\theta\in A|\mathcal D).$$

That's why the expectation problem is more "general".

Therefore, the problem we want to tackle is:

Problem 1: How to estimate expectations of functions of $\theta$, such as equation (1), under the posterior?

Monte Carlo methods are widely known for their capability to approximate intractable integrations. Suppose we can sample from the posterior distribution,

$$\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(R)} \sim p(\theta|\mathcal D),$$

if the sample size, $R$, is large enough, due to the law of large numbers, we can approximate integration by frequency counting:

$${\mathbb{P}}(\theta \in A|\mathcal D) \approx \frac{\#\{\theta^{(r)} \in A\}}{R},$$

where $\#\{\cdot\}$ counts how many samples falls in set $A$.

And general expectations of the form

$$\mathbb E[t(\theta)|\mathcal D] = \int t(\theta) p(\theta|\mathcal D) \mathrm{d}\theta = \frac{\int t(\theta) p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}$$

can also be approximated by its Monte Carlo estimation

$${\mathbb{E}}[t(\theta)|\mathcal D] \approx \hat t =\frac{1}{R} \sum_{r} t(\theta^{(r)}),$$

which is the sample average evaluated at the drawn samples.

Visual intuition

The following diagram illustrates the idea of Monte Carlo approximations. The true posterior distribution $p(\theta|\mathcal D)$ is plotted on the left; and the histogram of $R=2000$ random samples of the posterior is plotted on the right. Monte Carlo method essentially uses the histogram on the right to replace the true posterior. For example, to calculate the mean or expectation of $\theta$, i.e. $t(\theta) = \theta$, the Monte Carlo estimator becomes

$$\mathbb E[\theta|\mathcal D] \approx \frac{1}{R} \sum_r \theta^{(r)},$$

the sample average of the samples (which is very close to the ground truth on the left).

begin
    approx_samples = 2000
    dist = MixtureModel([Normal(2, sqrt(2)), Normal(9, sqrt(19))], [0.3, 0.7])
    
    Random.seed!(100)
    x = rand(dist, approx_samples)
    ids = (x .< 15) .& (x .> 0)
    prob_mc=sum(ids)/approx_samples
end;
let
    plt = Plots.plot(
        dist;
        xlabel=raw"$\theta$",
        ylabel=raw"$p(\theta|\mathcal{D})$",
        title="true posterior",
        fill=true,
        alpha=0.3,
        xlims=(-10, 25),
        label="",
        components=false,
    )
    Plots.vline!(plt, [mean(dist)]; label="true mean", linewidth=3)

    
    plt2 = Plots.plot(
        dist;
        xlabel=raw"$\theta$",
        fill=false,
        alpha=1,
        linewidth =3,
        xlims=(-10, 25),
        label="",
        components=false,
    )
        
        
    histogram!(plt2, x, bins=110, xlims=(-10, 25), norm=true, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo Approx.")
    Plots.vline!(plt2, [mean(x)]; label="MC mean", linewidth=3, color=2)

    Plots.plot(plt, plt2)
end

Similarly, calculating probability, such as $\mathbb P(0\leq \theta \leq 15)$, reduces to frequency counting:

$$ \hat{\mathbb{P}}(0\leq \theta \leq \theta) = \frac{\#\{0\leq \theta^{(r)}\leq 15\}}{2000} =0.905,$$

counting the proportion of samples that falls in the area of interest. The idea is illustrated in the following diagram.

let
    plt2 = Plots.plot(
        dist;
        xlabel=raw"$\theta$",
        fill=false,
        alpha=1,
        linewidth =3,
        xlims=(-10, 25),
        label="",
        components=false,
        size=(300,450)
    )
        
        
    histogram!(plt2, x, bins = -10:0.3:25 , xlims=(-10, 25), norm=false, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo est.")

    Plots.plot!(plt2, 0:0.5:15,
        dist;
        xlabel=raw"$\theta$",
        fill=true,
        color = :orange,
    	alpha=0.5,
        linewidth =3,
        xlims=(-10, 25),
        label=L"0 \leq θ \leq 15",
        components=false,
    )

    histogram!(plt2, x[ids], bins = -10:0.3:25, xlims=(-10, 25), norm=false, label="", color=:orange,  xlabel=raw"$\theta$")

end

Monte Carlo method is scalable

The most important property of Monte Carlo method is its scalability, which makes it a practical solution to Problem 1 even when $\theta \in R^D$ is of high dimension.

Monte Carlo method is scalable

The accuracy of the Monte Carlo estimate does not depend on the dimensionality of the space sampled, $D$. Roughly speaking, regardless of the dimensionality of $\theta$, the accuracy (squared error from the mean) remains the same.

Foldable("Further details about the scalability*", md"For simplicity, we assume ``t`` is a scalar-valued function. Note all expectations here are w.r.t the posterior distribution from which the samples are drawn. Firstly, it can be shown that the Monte Carlo estimator is unbiased: 

$$\mathbb E[\hat t] = \mathbb E\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ] =\frac{1}{R}\sum_r \mathbb E[t(\theta^{(r)})] = \mathbb E[t(\theta)].$$ It means, on average, the estimator converges to the true integration value. 


To measure the estimator's accuracy, we only need to find the estimator's variance as it measures the average squared error between the estimator and true value:

$$\mathbb V[\hat t] = \mathbb E[(\hat t -t)^2].$$ 

If we assume samples are independent draws from the distribution, the variance is then 

$$\mathbb V[\hat t] =\mathbb V\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ]= \frac{1}{R^2} \sum_r \mathbb{V}[t(\theta^{(r)})]=\frac{R\cdot \mathbb{V}[t(\theta^{(r)})]}{R^2}=\frac{\sigma^2}{R},$$ where

$$\sigma^2 = \mathbb V[t] = \int p(\theta|\mathcal D) (t(\theta)-\mathbb E[t])^2\mathrm{d}\theta$$ is some positive constant that only depends on the function ``t``. Therefore, as the number of samples ``R`` increases, the variance of ``\hat t``  will decrease linearly (the standard deviation, ``\sqrt{\mathbb V[\hat t]}``,  unfortunately, shrinks at a rate of ``\sqrt{R}``). Note the accuracy (variance) only depends on ``\sigma^2``, the variance of the particular statistic ``t`` rather than ``D``, the dimensionality.")
Further details about the scalability*

For simplicity, we assume $t$ is a scalar-valued function. Note all expectations here are w.r.t the posterior distribution from which the samples are drawn. Firstly, it can be shown that the Monte Carlo estimator is unbiased:

$$\mathbb E[\hat t] = \mathbb E\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ] =\frac{1}{R}\sum_r \mathbb E[t(\theta^{(r)})] = \mathbb E[t(\theta)].$$

It means, on average, the estimator converges to the true integration value.

To measure the estimator's accuracy, we only need to find the estimator's variance as it measures the average squared error between the estimator and true value:

$$\mathbb V[\hat t] = \mathbb E[(\hat t -t)^2].$$

If we assume samples are independent draws from the distribution, the variance is then

$$\mathbb V[\hat t] =\mathbb V\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ]= \frac{1}{R^2} \sum_r \mathbb{V}[t(\theta^{(r)})]=\frac{R\cdot \mathbb{V}[t(\theta^{(r)})]}{R^2}=\frac{\sigma^2}{R},$$

where

$$\sigma^2 = \mathbb V[t] = \int p(\theta|\mathcal D) (t(\theta)-\mathbb E[t])^2\mathrm{d}\theta$$

is some positive constant that only depends on the function $t$. Therefore, as the number of samples $R$ increases, the variance of $\hat t$ will decrease linearly (the standard deviation, $\sqrt{\mathbb V[\hat t]}$, unfortunately, shrinks at a rate of $\sqrt{R}$). Note the accuracy (variance) only depends on $\sigma^2$, the variance of the particular statistic $t$ rather than $D$, the dimensionality.

How to sample ? – MCMC

In the previous section, we have established that Monte Carlo estimation is a scalable method if we can obtain samples from a posterior distribution. That is a big "if" to assume. Without a practical sampling method, no Monte Carlo estimators can be calculated.

We should also note that for a general Bayesian inference problem the posterior can only be evaluated up to some unknown constant:

$$p(\theta|\mathcal D) \propto p(\theta) p(\mathcal D|\theta),$$

where the scalar $1/p(\mathcal D)$ involves a nasty integration which we do not know how to compute.

The question we face now is a sample generation problem:

Problem 2: how to generate samples $\{\theta^{(r)}\}_{r=1}^R$ from a un-normalised distribution: $p(\theta|\mathcal D) \propto p(\theta)p(\mathcal D|\theta)?$

Note. If we apply log on the posterior distribution, the log density becomes a sum of log prior and log-likelihood: $\ln p(\theta|\mathcal D) = \ln p(\theta) + \ln p(\mathcal D|\theta),$ which is faster and numerically stable for computers to compute. Additions are in general faster than multiplications/divisions for floating number operations.

Basic idea

A class of methods called Markov Chain Monte Carlo (MCMC) is a popular and successful candidate to generate samples from a non-standard distribution. Markov Chain Monte Carlo (MCMC) algorithm is formed by two concepts:

The idea is to produce posterior samples $\{\theta^{(r)}\}_{r=1}^R$ in sequence, each one depending only on $\theta^{(r-1)}$ and not on its more distant history of predecessors, i.e. a Markov Chain (which accounts for the first MC of the acronym MCMC). When the transition probability of the chain satisfies certain conditions, Markov Chain theory then says that, under quite general conditions, the empirical distribution of the simulated samples will approach the desired target distribution as we simulate the chain long enough, i.e. when $R$ is large.

Since the sequence converges to the target distribution when $R$ is large, we usually only retain the last chunk of the sequence as "good samples" or equivalently discard the initial samples as burn-in since they are usually not representative of the distribution to be approximated. For example, if the Markov Chain has been simulated by 4,000 steps, we only keep the last 2000 to form an empirical distribution of the target.

Metropolis Hastings

Metropolis-Hastings (MH) algorithm is one of the MCMC methods (and arguably the most important one). MH constructs a Markov Chain in two simple steps:

A key observation to note here is MH's operations do no involve the normalising constant:

$$\frac{p(\theta_{\text{proposed}}|\mathcal D)}{p(\theta_{\text{current}}|\mathcal D)} = \frac{\frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{\cancel{p(\mathcal D)}}}{\frac{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})}{\cancel{p(\mathcal D)}}} = \frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})} \triangleq \frac{p^\ast(\theta_{\text{proposed}})}{p^\ast(\theta_{\text{current}})}.$$

Therefore, we only need to evaluate the un-normalised posterior distribution $p^\ast$.

The algorithm details are summarised below.

Metropolis-Hastings algorithm
  1. Initialise $\theta^{(0)}$ arbitrary

  2. For $r = 1,2,\ldots$:

    • sample a candidate value from $q$:

    $$\theta' \sim q(\theta|\theta^{(r)})$$

    • evaluate $a$, where

    $$a = \text{min}\left \{\frac{p^\ast(\theta')q(\theta^{(t)}|\theta')}{p^\ast(\theta^{(t)}) q(\theta'|\theta^{(t)})}, 1\right \}$$

    • set

    $$\theta^{(t+1)} = \begin{cases} \theta' & \text{with probability } a\\ \theta^{(t)} & \text{with probability } 1-a\end{cases}$$

Remark: when the proposal distribution is symmetric, i.e.

$$q(\theta^{(t)}|\theta')= q(\theta'|\theta^{(t)}),$$

the acceptance probablity reduces to

$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)}) }, 1\right \}$$

and the algorithm is called Metropolis algorthm. A common symmetric proposal distribution is random-walk Gaussian, i.e. $q(\theta^{(t)}|\theta')= \mathcal N(\theta', \Sigma),$ where $\Sigma$ is some fixed variance matrix.

Question: sample a biased 🎲 by Metropolis Hasting

We want to obtain samples from a completely biased die that lands with threes (⚂) all the time. If one uses the Metropolis-Hastings algorithm [1] to generate samples from the biased die and a fair die is used as the proposal of the MH algorithm.

Answer

Depending on the initial state, the chain can either be $3,3,3,\ldots$ or $x,x,\ldots,x,3,3,3,\ldots$ where $x \in \{1,2,4,5,6\}$.

The proposal is symmetric. Since a fair die is used as the proposal, the proposal probability distribution is 1/6 for all 6 facets.

As a result, the acceptance probability $a$ only depends on the ratio of the target distribution. There are two possible scenarios:

  1. the chain starts at 3, then all following proposals other than 3 will be rejected (as $a=\tfrac{0}{1}=0\%$). Therefore, only samples of 3 will be produced.

  2. the chain is initialised with a state other than 3, e.g. $x=1$, then all proposals other than 3 will be rejected [2], so the initial samples will be all ones ($1,1,1,\ldots$); until a three is proposed, then it will be accepted with $a=\frac{1}{0}= 100\%$ chance; and only three will be produced onwards (the same argument of case 1).

Regardless of the starting state, the chain will eventually converge and produce the correct samples, i.e. $3,3,3,3,\ldots$. For chains starting with a state other than 3, the initial chunk (all ones in the previous case) should be discarded as burn-in: the chain has not yet converged.

Luckily, the discard initial chunk will be short. On average, we should expect the burn-in lasts for 6 iterations only, as the length is a geometric random variable with $p=1/6$.

Demonstration

To demonstrate the idea, consider sampling from a bivariate Gaussian distribution:

$$\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \sim \mathcal N\left (\begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\right ).$$

The Gaussian distribution has a mean of zero and variances of 1; $\rho$ measures the correlation between the two dimensions. Here we use $\rho = 0.9$. The probability density can be written as

$$p^\ast(\boldsymbol \theta) \propto \text{exp}\left (-\frac{1}{2} \boldsymbol \theta^\top \mathbf \Sigma^{-1} \boldsymbol \theta\right ),$$

where $\mathbf \Sigma = \begin{bmatrix} 1 & 0.9 \\ 0.9 & 1 \end{bmatrix}.$ The target distribution's contour plot is shown below.

begin
    ρ = 0.9
    μ = [0, 0]
    Σ = [1.0 ρ; ρ 1]
    target_p = (x) -> logpdf(MvNormal(μ, Σ), x)
    plt_contour = plot(-5:0.1:5, -5:0.1:5, (x, y)-> exp(target_p([x,y])),legend = :none, st=:contour, linewidth=0.5, la=0.5, levels=5, xlabel=L"\theta_1", ylabel=L"\theta_2", title="Contour plot a highly correlated bivariate Gaussian")
end

To apply an MH algorithm, we adopt a simple uncorrelated bivariate Gaussian as the proposal distribution:

$$q(\theta'|\theta^{(t)}) = \mathcal N\left (\theta^{(t)}, \sigma^2_q \times \begin{bmatrix} 1 & 0 \\ 0 & 1\end{bmatrix}\right )$$

$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)})}, 1\right \}$$

The Metropolis-Hastings algorithm with the mentioned proposal distribution can be implemented in Julia as follows:

# Metropolis Hastings with isotropic Gaussian proposal
- `ℓπ`: the target log probability density
- `mc`: number of Monte Carlo samples to draw 
- `dim`: the dimension of the target space
- `Σ`: proposal's covariance matrix
- `x0`: initial starting state
function metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
	samples = zeros(dim, mc)
	samples[:, 1] = x0
	L = cholesky(Σ).L
	ℓx0 = ℓπ(x0) 
	accepted = 0
	for i in 2:mc
		# radomly draw from the proposal distribution 
		# faster than xstar = rand(MvNormal(x0, Σ))
		xstar = x0 + L * randn(dim)
		ℓxstar = ℓπ(xstar)
		# calculate the acceptance ratio `a`
		# note ℓπ is in log scale, need to apply exp
		a = exp(ℓxstar - ℓx0) 
		# if accepted
		if rand() < a
			x0 = xstar
			ℓx0 = ℓxstar
			accepted += 1
		end
		# store the sample 
		@inbounds samples[:, i] = x0
	end
	accpt_rate = accepted / (mc-1)
	return samples, accpt_rate
end

To use the algorithm draw samples from the target Gaussian distribution, we simply feed in the required inputs, e.g.

	# set the target posterior distribution to sample
	target_ℓπ = (x) -> logpdf(MvNormal(μ, Σ), x)
	# set proposal distribution's variance
	σ²q = 1.0
	mh_samples, rate=metropolis_hastings(target_p, 2000; Σ = σ²q * Matrix(I, 2, 2), x0=[-2.5, 2.5])

An animation is listed below to demonstrate the MH algorithm with a proposal distribution $\sigma^2_q = 1.0$, where

begin
    anim_σ_1 = demo_mh_gif(target_p; qΣ = 1.0 * Matrix(I,2,2), x₀ = [-2, 2])
    gif(anim_σ_1, fps=5)
end

It can be observed that

2000 samples drawn from the MH algorithm after 2000 burn-in are shown below.

begin
    Random.seed!(100)
    σ²q = 1.0
    mh_samples = metropolis_hastings(target_p, 4000; Σ = σ²q * Matrix(I,2,2), x0=[-2, 2])[1]
end;
begin
    plt = covellipse(μ, Σ,
    n_std=1.64, # 5% - 95% quantiles
    xlims=(-4, 4), ylims=(-4, 4),
    alpha=0.3,
    c=:steelblue,
    label="90% HPD",
    xlabel=L"\theta_1", ylabel=L"\theta_2")

    scatter!(plt, mh_samples[1, 2001:end], mh_samples[2, 2001:end], label="MH samples", mc=:red, ma=0.3)

end

Reduce dependence of the chain

Ideally, the traditional Monte Carlo method requires samples to be independent. MCMC samples, however, are simulated as Markov chains: i.e. the next sample depends on the current state, therefore MCMC samples are all dependent.

It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain reaches equilibrium. However, an estimation's standard error, in general, deteriorates when the samples in use are more dependent.

There are two commonly used practices to reduce the temporal correlations of an MCMC chain: thinning and parallelism. And they can be used together to further improve the sample quality.

Thinning

As the dependence decreases as the lag increases, one therefore can retain MCMC samples at some fixed lag. For example, after convergence, save every 10th sample.

Parallel chains

Each MCMC chain simulates independently, to fully make use of modern computers' concurrent processing capabilities, we can run several chains in parallel.

Why should we run parallel chains rather than a single long chain? Note that ideally, we want the samples to be independent. Samples from one single chain is temporally dependent. Running multiple chains exploring the posterior landscape independently, when mixing together, produces more independent samples.

The following animation shows the evolution of four parallel Metropolis-Hastings chains. To make the visualisation cleaner, rejected proposals are not shown in the animation. The four chains start at four initial locations (i.e. the four corners). Note that all four chains quickly move to the high-density regions regardless of their starting locations, which is a key property of MCMC.

begin
    Random.seed!(100)
    # simulate 4000 MCMC steps for each chain
    mc_iters = 4000
    # pre-allocate storage for the chains
    mh_samples_parallel = zeros(mc_iters, 2, 4)
    # four starting locations
    x0s = [[-1, -1], [-1, 1], [1,1], [1, -1]] .* 3.5
    # simulate 4 chains in parallel
    Threads.@threads for t in 1:4
        mh_samples_parallel[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q * Matrix(I, 2, 2), x0=x0s[t])[1]'
    end
end
let
    # create a plot of the high probability density for the target distribution
    plt_anim = covellipse(μ, Σ,
    n_std=1.64, # 5% - 95% quantiles
    xlims=(-4, 4), ylims=(-4, 4),
    alpha=0.3,
    c=:steelblue,
    legend = :outerright,
    label="90% HPD",
    xlabel=L"\theta_1", ylabel=L"\theta_2", title="Animation of 4 Parallel MH Chains")

    labels = "chain " .* string.(1:4)
    # create the animation for the first 100 steps
    mh_anim_parallel = @animate for i  in 1:99
        for ch in 1:4
            scatter!(plt_anim, (mh_samples_parallel[i, 1, ch], mh_samples_parallel[i, 2, ch]),
                 label= i == 1 ? labels[ch] : false, mc=ch, ma=0.5)
            plot!(plt_anim, mh_samples_parallel[i:i+1, 1, ch], mh_samples_parallel[i:i+1, 2, ch], st=:path, lc=ch, la=0.5, label=false)
        end
        
    end
    # show the animation at 10 frames per second
    gif(mh_anim_parallel, fps= 10)
end

Limitations of Metropolis-Hastings

MH algorithm's performance depends on the proposal's quality. And setting a suitable proposal distribution for an MH algorithm is not easy, especially when the target distribution is high-dimensional. As a rule of thumb, we should aim at an acceptance rate between 0.2 to 0.6. Too high or too low signals the chain is struggling.

Acceptance rate too high

Usually happens when $\sigma^2_q$ is too small, proposals are all close to each other in one high-density area $\Rightarrow$ most of the proposals are accepted, but, as a result $\Rightarrow$ other high-density areas are not explored sufficiently enough.

The following animation illustrates the idea: the proposal's variance is set as $\sigma_q^2 = 0.005$; despite a high acceptance rate of 0.924, the chain only makes tiny moves and therefore does not explore the target distribution well enough.

begin
    anim_σ_001 = demo_mh_gif(target_p; qΣ = 0.005 * Matrix(I,2,2), x₀ = [-2, -2], mc = 500)
    gif(anim_σ_001, fps=5)
end

Acceptance rate too low

Usually happens when $\sigma^2_q$ is too large, the proposals jump further but likely to propose less desired locations $\Rightarrow$ a lot of rejections $\Rightarrow$ very temporal correlated samples.

The chain shown below employs a more audacious proposal distribution with $\sigma_q^2 = 10.0$. As a result, most of the proposals are rejected which leads to an inefficient sampler (most of the samples are the same).

begin
    anim_σ_15 = demo_mh_gif(target_p; qΣ = 10 * Matrix(I,2,2), x₀ = [-2, -2], mc = 500)
    gif(anim_σ_15, fps=5)
end

To avoid the difficulty of setting good proposal distributions, more advanced variants of MH algorithms, such as the Hamiltonian sampler (HMC), No-U-Turn sampler (NUTS), have been proposed. Instead of randomly exploring the target distribution, HMC and its variants employ the gradient information of the target distribution to help explore the landscape.

Digress: Julia's MCMCChains.jl

Julia has a wide range of packages to analyse MCMC chains. In particular, MCMCChains.jl package provides us tools to visualise and summarise MCMC chains.

We can construct Chains object using MCMCChains.jl by passing a matrix of raw chain values along with optional parameter names:

The initial 3 samples of the raw samples:

mh_samples_parallel[1:3, :, :]
mh_samples_parallel[1:3, :,:]
3×2×4 Array{Float64, 3}:
[:, :, 1] =
 -3.5  -3.5
 -3.5  -3.5
 -3.5  -3.5

[:, :, 2] =
 -3.5      3.5
 -3.5      3.5
 -3.61485  2.23203

[:, :, 3] =
 3.5      3.5
 3.5      3.5
 3.47994  2.45702

[:, :, 4] =
 3.5      -3.5
 2.83202  -2.32634
 2.83202  -2.32634

Create a Chains object by passing the raw sample matrix and names of the parameters (optional)

# use MCMCChains.jl to create a Chains object 
# the first 2000 samples are discarded as burn-in
# Note that to apply thinning=2, one can use ``mh_samples_parallel[2001:2:end, :, :]``
chain_mh = Chains(mh_samples_parallel[2001:end, :, :], [:θ₁, :θ₂])

Visualisations

One can visualise Chains objects by

For example, the following plots show the four parallel MH chains for the bi-variate Gaussian example

It can be seen visually that all four chains converge to roughly the same equilibrium distributions.

begin
    chain_mh = Chains(mh_samples_parallel[2001:end, :, :], [:θ₁, :θ₂])
    plot(chain_mh)
end

Other visualisation methods include

For example, the following plot uses corner() to show a scatter plot of the chain.

begin
    corner(chain_mh)
end

Summary statistics

MCMCChains.jl also provides easy-to-use interfaces to tabulate important statistics of MCMC chains instances

# summary of the chain
summarystats(chain_mh)
describe(chain_mh)
summarystats(chain_mh)
parameters mean std naive_se mcse ess rhat
1 :θ₁ 0.109403 0.991057 0.0110803 0.0477931 363.28 1.00179
2 :θ₂ 0.119246 0.995301 0.0111278 0.0482049 368.43 1.00163
describe(chain_mh)
2-element Vector{ChainDataFrame}:
 Summary Statistics (2 x 7)
 Quantiles (2 x 6)

MCMC diagnosis

Markov chain theory only provides us with a theoretical guarantee:

if one simulates an MCMC chain long enough, eventually the sample empirical distribution will converge to the target posterior distribution.

Unfortunately, this is only a theoretical guarantee. We do not have the time and resources to run a chain indefinitely long. In practice, we simulate chains at fixed lengths and inpect the chains to check convergence.

MCMC metrics

Luckily, there are multiple easy-to-compute metrics to diagnose a chain. Most of the techniques apply stationary time series models to access the performance of a chain. The most commonly used metrics are $\hat R$ statistic, and effective sample size (ess).

R̂ statistic (rhat)

R̂ is a metric that measures the stability of a chain. Upon convergence, any chunk of a chain, e.g. the first and last halves of the chain, or parallel chains, should be similar to each other. $\hat R$ statistic, which is defined as the ratio of within-chain to between-chain variability, should be close to one if all chains have converged.

As a rule of thumb, a valid converging chain's $\hat R$ statistic should be less than 1.01.

Effective sample size (ess)

The Traditional Monte Carlo method requires samples to be independent. However, MCMC samples are all dependent, as the chain is simulated in a Markovian fashion: the next sample depends on the previous state. It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain has mixed well. However, the standard error of the estimation, in general, deteriorates when the samples are more dependent.

ess roughly estimates how many equivalent independent samples are in a dependent chain. Since samples in an MCMC chain are correlated, they contain less information than truly independent samples. ess therefore discounts the sample size by some factor that depends on the temporal correlation between the iterations.

We can also measure the efficiency of an MCMC algorithm by calculating

$$\text{efficiency} = \frac{\text{ess}}{\text{iterations}},$$

which measures the information content in a chain. One needs to run a highly efficient algorithm with fewer iterations to achieve the same result.

Larger effective sample size (ess) implies the MCMC algorithm is more efficient

Examples To demonstrate the ideas, we compare ess and of the three MH chains with different proposal variances: $\sigma^2_q=$ 1.0, .005 and 20. Remember, $\sigma^2_q=1.0$ performs the best; .005 and 20 are less optimal.

Summary statistics with MH chain of $\sigma^2_q=1.0$

summarystats(chain_mh)
parameters mean std naive_se mcse ess rhat
1 :θ₁ 0.109403 0.991057 0.0110803 0.0477931 363.28 1.00179
2 :θ₂ 0.119246 0.995301 0.0111278 0.0482049 368.43 1.00163
begin
    # simulate two chains with two different proposal variances
    Random.seed!(100)
    σ²q_too_small = 0.005
    σ²q_too_big = 20.0
    mh_samples_too_small = zeros(mc_iters, 2, 4)
    mh_samples_too_big = zeros(mc_iters, 2, 4)
    mh_samples_ideal = zeros(mc_iters, 2, 4)
    Threads.@threads for t in 1:4
        mh_samples_too_small[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q_too_small * Matrix(I, 2, 2), x0=x0s[t])[1]'

        mh_samples_too_big[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = σ²q_too_big * Matrix(I, 2, 2), x0=x0s[t])[1]'

        mh_samples_ideal[:, :, t] = metropolis_hastings(target_p, mc_iters; Σ = Σ, x0=x0s[t])[1]'
    end
    chain_too_small = Chains(mh_samples_too_small[2001:end, :, :], [:θ₁, :θ₂])
    chain_too_big = Chains(mh_samples_too_big[2001:end, :, :], [:θ₁, :θ₂])
    chain_ideal = Chains(mh_samples_ideal[2001:end, :, :], [:θ₁, :θ₂])
end;

Summary statistics with MH chain of $\sigma^2_q=.005$

summarystats(chain_too_small)
parameters mean std naive_se mcse ess rhat
1 :θ₁ 0.351195 0.941285 0.0105239 0.097075 20.0356 1.88365
2 :θ₂ 0.398657 0.940385 0.0105138 0.0970787 19.4734 1.95291

Summary statistics with MH chain of $\sigma^2_q=20.0$

summarystats(chain_too_big)
parameters mean std naive_se mcse ess rhat
1 :θ₁ 0.156851 0.971666 0.0108636 0.0669214 134.299 1.02462
2 :θ₂ 0.145832 0.983643 0.0109975 0.0687366 134.955 1.0273

It can be observed that

Visual inspection

Simple trace plots reveal a lot of information about a chain. One can diagnose a chain by visual inspection.

What do bad chains look like? The following two trace-plots show the chain traces of the two less desired MH algorithms for the bivariate Gaussian example: one with $\sigma^2_q=0.005$ and $\sigma^2_q=20$.

Recall that when $\sigma^2_q$ is too small, a chain proposes small changes at each iteration. As a result, the chain does not explore HPD region sufficiently well. If one splits the chain into two halves, the two chunks (with green and red backgrounds) exhibit drastic different values.

begin
plot(1:1000, Array(chain_too_small[1:1000,1:1, 1:1]), title="A bad chain; proposal variance " * L"\sigma^2_q = 0.005", label="", color=1, xlabel="Iteration", ylabel="Sample value", size=(600,300))
vspan!([0,1000], color = :green, alpha = 0.2, labels = "");
plot!(1001:2000, Array(chain_too_small[1001:end, 1:1, 1:1]),  label="", color=1)
vspan!([1001,2000], color = :red, alpha = 0.2, labels = "");
end

On the other hand, when $\sigma^2_q$ is too large, the chain jumps back and force with large steps, resulting most proposals being rejected and old samples being retained. The chain becomes very sticky. The chain does not contain a lot of information comparing with independent samples.

plot(Array(chain_too_big[:,1:1, 1:1]), title="A bad chain; proposal variance "*L"\sigma^2_q = 20", xlabel="Iteration", ylabel="Sample value", label="", size=(600,300))

What good chains should look like ? The figure below shows trace plots of four different chains with ess= 4.6, 49.7, 90.9, and 155. The top two are bad chains. And the bottom two chains are of relatively better quality. A good chain should show stable statistical properties across the iterations with a reasonable level of variance.

let
    ess = zeros(4)
    ess[1] = MCMCChains.ess_rhat(chain_too_small[:,1:1,1:1])[:,:ess]
    ess[2] = MCMCChains.ess_rhat(chain_too_big[:,1:1,1:1])[:,:ess]
    ess[3] = MCMCChains.ess_rhat(chain_mh[:,1:1,1:1])[:,:ess]
    ess[4] = MCMCChains.ess_rhat(chain_ideal[:,1:1,1:1])[:,:ess]
    ess= round.(ess; digits=1)
    p1 = plot(chain_too_small[:, 1, 1], title="ess=$(ess[1])", label="")
    p2 = plot(chain_too_big[:,1,1], title="ess=$(ess[2])", label="")
    p3 = plot(chain_mh[:,1,1], title="ess=$(ess[3])", label="")
    p4 = plot(chain_ideal[1:end, 1, 1], title="ess=$(ess[4])", label="")
    Plots.plot(p1, p2, p3, p4)
end

Other MCMC samplers*

Metropolis-Hastings is considered the mother of most modern MCMC algorithms. There are a lot of other variants of MCMC algorithms that can be considered as specific cases of MH algorithm. We will quickly introduce two other popular variants: Gibbs sampling and Hamiltonian sampler. We will focus on the intuition behind the samplers.

Gibbs sampling

Gibbs sampling reduces a multivariate sampling problem to a series of uni-variate samplings. Assume the target distribution is bivariate with density $p(\theta_1, \theta_2)$, and the chain is currently at state $(\theta_1^{(r)}, \theta_2^{(r)})$, Gibbs sampling alternatively samples from their full conditionals in two steps:

The new sample $(\theta_1^{(r+1)}, \theta_2^{(r+1)})$ is retained as a new sample.

Visual intuition

The following diagram demonstrates how Gibbs sampling explores the bivariate Gaussian distribution. Note the zig-zag behaviour: at each step, Gibbs sampling only changes one dimension of the sample.

begin
    Random.seed!(100)
    # run Gibbs sampling  for the bivariate Gaussian example
    # check the appendix for the details of the implementation
    # 4000 iterations in total and starting location at x0
    samples_gibbs = gibbs_sampling(4000; x0=[-2.5, 2.5])

end;
let
    plt_gibbs = covellipse(μ, Σ,
        n_std=1.64, # 5% - 95% quantiles
        xlims=(-4, 4), ylims=(-4, 4),
        alpha=0.3,
        c=:steelblue,
        label="90% HPD",
        xlabel=L"\theta_1", ylabel=L"\theta_2")
    # create the animation
    gibbs_anim = @animate for i in 1:200
        scatter!(plt_gibbs, (samples_gibbs[1, i], samples_gibbs[2, i]),
                 label=false, mc=:red, ma=0.5)
        plot!(samples_gibbs[1, i:i + 1], samples_gibbs[2, i:i + 1], seriestype=:path,
              lc=:green, la=0.5, label=false)
    end
    gif(gibbs_anim, fps=10)
end

Multivariate Gibbs sampling The general Gibbs sampling algorithm for a multi-variate problem is listed below. The idea is the same, at each step, a series of small steps based on the full conditionals are made and chained together.

Gibbs sampling
  1. Initialise $\boldsymbol \theta^{(0)}=[\theta_1^{(0)},\theta_2^{(0)}, \ldots, \theta_D^{(0)} ]$ arbitrary

  2. For $r = 1,2,\ldots$:

    • sample dimension $1$:

    $$\theta_1^{(r)} \sim p(\theta_1|\theta_2^{(0)}, \ldots, \theta_D^{(0)})$$

    • sample dimension $2$:

    $$\theta_2^{(r)} \sim p(\theta_2|\theta_1^{(1)}, \ldots, \theta_D^{(0)})$$

    $$\vdots$$

    • sample dimension $D$:

    $$\theta_D^{(r)} \sim p(\theta_D|\theta_1^{(1)}, \ldots, \theta_{D-1}^{(1)})$$

Gibbs sampling is a Metropolis-Hastings

One drawback of MH sampler is one needs to specify a proposal distribution. Gibbs sampling alleviates this burden from the user by using the full conditionals of the target distribution as proposal distribution. Gibbs sampling, therefore, is a specific case of MH algorithm. One can also show that when full conditionals are used, the acceptance probability is always 100%. Therefore, there is no rejection step in a Gibbs sampling.

Foldable("Further details on Gibbs sampling is a MH algorithm.", md"For simplicity, we only consider bivariate case. Assume the chain is currently at state ``\boldsymbol \theta^{(r)}=(\theta_1^{(r)}, \theta_2^{(r)})``, the proposal distribution proposes to move to a new state ``\boldsymbol \theta' = (\theta_1', \theta_2')`` with a proposal density

$$q(\boldsymbol \theta'|\boldsymbol \theta^{(r)}) = p(\theta_1'|\theta_2') \cdot \mathbf 1({\theta_2'= \theta_2^{(r)})},$$

where ``\mathbf 1(\cdot)`` returns 1 when the testing condition is true and 0 otherwise. The transition kernel basically states ``\theta_2^{(r)}`` is not changed.


The acceptance probability then is 

$$a \triangleq \frac{p(\boldsymbol \theta') q(\boldsymbol \theta^{(r)}|\boldsymbol \theta')}{p(\boldsymbol \theta^{(r)})q(\boldsymbol \theta'|\boldsymbol \theta^{(r)})} = \frac{p(\theta_1', \theta_2') \times p(\theta_1^{(r)}|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}.$$

When ``\theta_2' = \theta_2^{(r)}``,

$$a = \frac{p(\theta_1', \theta_2^{(r)}) \times p(\theta_1^{(r)}|\theta_2^{(r)})}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2^{(r)})}=  \frac{p(\theta_1', \theta_2^{(r)}) \times \frac{p(\theta_1^{(r)}, \theta_2^{(r)})}{p(\theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times \frac{p(\theta_1',\theta_2^{(r)})}{p(\theta_2^{(r)})}} = \frac{p(\theta_2^{(r)})}{p(\theta_2^{(r)})} =1.0,$$

when ``\theta_2'\neq \theta_2^{(r)}``, we have ``a=\tfrac{0}{0}=1``, a trivial case. Therefore, there is no need to test the proposal. The proposed state should also be accepted.
")
Further details on Gibbs sampling is a MH algorithm.

For simplicity, we only consider bivariate case. Assume the chain is currently at state $\boldsymbol \theta^{(r)}=(\theta_1^{(r)}, \theta_2^{(r)})$, the proposal distribution proposes to move to a new state $\boldsymbol \theta' = (\theta_1', \theta_2')$ with a proposal density

$$q(\boldsymbol \theta'|\boldsymbol \theta^{(r)}) = p(\theta_1'|\theta_2') \cdot \mathbf 1({\theta_2'= \theta_2^{(r)})},$$

where $\mathbf 1(\cdot)$ returns 1 when the testing condition is true and 0 otherwise. The transition kernel basically states $\theta_2^{(r)}$ is not changed.

The acceptance probability then is

$$a \triangleq \frac{p(\boldsymbol \theta') q(\boldsymbol \theta^{(r)}|\boldsymbol \theta')}{p(\boldsymbol \theta^{(r)})q(\boldsymbol \theta'|\boldsymbol \theta^{(r)})} = \frac{p(\theta_1', \theta_2') \times p(\theta_1^{(r)}|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}.$$

When $\theta_2' = \theta_2^{(r)}$,

$$a = \frac{p(\theta_1', \theta_2^{(r)}) \times p(\theta_1^{(r)}|\theta_2^{(r)})}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2^{(r)})}= \frac{p(\theta_1', \theta_2^{(r)}) \times \frac{p(\theta_1^{(r)}, \theta_2^{(r)})}{p(\theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times \frac{p(\theta_1',\theta_2^{(r)})}{p(\theta_2^{(r)})}} = \frac{p(\theta_2^{(r)})}{p(\theta_2^{(r)})} =1.0,$$

when $\theta_2'\neq \theta_2^{(r)}$, we have $a=\tfrac{0}{0}=1$, a trivial case. Therefore, there is no need to test the proposal. The proposed state should also be accepted.

We have already shown that a high acceptance rate does not necessarily imply high efficiency. The same applies to Gibbs sampling. The zig-zag exploration scheme of Gibbs sampling can be slow for highly correlated sample space. For example, in our bi-variate Gaussian example, Gibbs sampling has achieved around ess = 129 (out of 2000 samples) and an efficiency around 0.0645.

Summary statistics of Gibbs sampling

begin
    gibbs_chain = Chains(samples_gibbs[:, 2001:end]', [:θ₁, :θ₂])
    summarystats(gibbs_chain)
end
parameters mean std naive_se mcse ess rhat
1 :θ₁ 0.0802593 0.963368 0.0215416 0.0764537 130.184 1.03633
2 :θ₂ 0.0918141 0.953711 0.0213256 0.0774604 127.312 1.0346

Hamiltonian sampler

We have discussed the limitation of the Metropolis-Hastings algorithm: the proposal distribution is hard to tune. Ideally, we want a proposal distribution with the following properties:

Simple proposals, such as Gaussians, cannot achieve both of the desired properties. The problem lies in their random-walk exploration nature. The proposals blindly propose the next steps and ignore the local geographical information of the target distribution. Gradients, which always point to the steepest ascent direction, are the geographic information that can guide the proposal to reach a further yet high probability region.

Visual intuition

To understand the idea, let's consider a more complicated target distribution which is formed by super-imposing three bivariate Gaussians together. The three Gaussians are with means $[-4,0], [4,0],$ and $[0,0]$, and the variances $\begin{bmatrix} 2 & 1.5 \\ 1.5& 2\end{bmatrix}$, $\begin{bmatrix} 2 & -1.5 \\ -1.5 & 2\end{bmatrix}$ and $\begin{bmatrix} 2 & 0 \\ 0 & 2\end{bmatrix}$ respectively.

The contour and surface of the targe distribution is plotted below for reference.

begin
    d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
    d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
    d3 = MvNormal([0, 0], [2 0; 0. 2])
    d = MixtureModel([d1, d2, d3])
end;
begin
    ℓ(x) = logpdf(d, x)
    ∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
end;
begin
    con_p = contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
    surf_p = surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
    Plots.plot(con_p, surf_p, layout=(1,2))
end

A vector field view of the gradient of the target density is shown below. Note that at each location, a gradient (a blue vector) always points to the steepest ascent direction. The gradient therefore provides key geographical information about the landscape. Hamiltonian sampler proposes the next state by making use of the gradient information.

begin
    meshgrid(x, y) = (repeat(x, outer=length(y)), repeat(y, inner=length(x))) # helper function to create a quiver grid.
    ∇ℓ_(x, y) = ∇ℓ([x, y])/5
    contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, title="Contour plot and gradient plot of the target density")
  		xs,ys = meshgrid(range(-9, stop=9, length=15), range(-6,stop=6,length=11))
    quiver!(xs, ys, quiver=∇ℓ_, c=:blue)
end
Foldable("Julia code for the target distribution and visualisation.", md"By using Julia's `Distributions.jl` and `ForwardDiff.jl`, we can formulate the density and its corresponding gradient function as following:

```julia
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
# Superimpose the three Gaussians together to form a new density
d = MixtureModel([d1, d2, d3])
# log likelihood of the mixture distribution
ℓ(x) = logpdf(d, x)
# gradient of the log pdf
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
```

The contour and surface of the targe distribution can be plotted easily via

```julia
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
```

")
Julia code for the target distribution and visualisation.

By using Julia's Distributions.jl and ForwardDiff.jl, we can formulate the density and its corresponding gradient function as following:

d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
# Superimpose the three Gaussians together to form a new density
d = MixtureModel([d1, d2, d3])
# log likelihood of the mixture distribution
ℓ(x) = logpdf(d, x)
# gradient of the log pdf
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)

The contour and surface of the targe distribution can be plotted easily via

contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)

A useful physics metaphor

A very useful metaphor to understand the Hamiltonian sampler is to imagine a hockey pluck sliding over a surface of the negative target distribution without friction. The surface is formed by the negative target density (i.e. flip the surface plot, and intuitively a mountain becomes a valley).

surface(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), legend=false, title="Negative of the target density")

At each iteration, a random initial force (usually drawn from a Gaussian) is applied to the pluck, the pluck then slides in the landscape to reach the next state.

A numerical algorithm called Leapfrog integration is usually used to simulate the movement of the pluck. In particular, the trajectory is simulated by a sequence of $T$ steps with length $\epsilon$. The two tuning parameters are:

Therefore, the total length of the pluck's movement is proportional to $\epsilon \times T$.

The animation below demonstrates the idea, where 10 independent HMC proposals are simulated from an initial starting location at $[0, 2.5]$ by the Leapfrog algorithm:

begin
    T_step = 50
    traj_n = 10
    x0 = [0, 2.5]
    hmc_trajs = zeros(2, T_step+1, traj_n) 
    Random.seed!(189)
    for t in 1:traj_n	
        hmc_trajs[:, :, t] = hamiltonian_proposal_step(ℓ, ∇ℓ, x0, 0.1, T_step)
    end
end
let
    hmc_anim_plt = plot(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), st=:contour, legend=false, title="HMC's proposal animation", xlabel=L"\theta_1", ylabel=L"\theta_2")

    scatter!(hmc_anim_plt, [x0[1]], [x0[2]], color= traj_n+1)
    ts = 1:traj_n
    hmc_anim = @animate for i in 1:2:T_step
        for t in ts 
            plot!(hmc_trajs[1, i:i + 1, t], hmc_trajs[2, i:i + 1, t], seriestype=:path,
                      lc=t, la=1.0, lw=3.0, label="")
        end
    end
    gif(hmc_anim, fps=2)
end
let
    hmc_plt = plot(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), st=:contour, legend=false, title="HMC's proposals end snapshot", xlabel=L"\theta_1", ylabel=L"\theta_2")
    
    scatter!([x0[1]], [x0[2]], color= traj_n+1)
    for t in 1:traj_n
        for i in 1:2:T_step
            plot!(hmc_trajs[1, i:i + 1, t], hmc_trajs[2, i:i + 1, t], seriestype=:path,
                      lc=t, la=1.0, lw=3.0, label="")
        end
        # scatter!([hmc_trajs[1, end, t]], [hmc_trajs[2, end, t]], color=t, label="HMC $(t)")
        x = [x0[1], hmc_trajs[1, end, t]]
        y = [x0[2], hmc_trajs[2, end, t]]
        plot!(x,y, marker=:circle, arrow=true, arrowsize=1, lw=2, lc=t, mc=t)
    end
    
    # plot!(x,y, marker=:circle, arrow=true, arrowsize=0.5)
    hmc_plt
end

It is important to observe that

We have only illustrated the intuition behind HMC method. Keen readers should read references such as [3] for further details about Hamiltonian MC methods.

A comparison between the algorithms

In this section, we are going to compare the original Metropolis-Hastings (MH) with the Hamiltonian Monte Carlo sampler (HMC). The mixture density model is reused here.

Animation check

Firstly, we visually inspect the algorithms' sampling process. Animations of MH and HMC algorithms are presented below to help us to gain some feeling about the algorithms. The MH algorithm has used a proposal variance $\sigma_q^2 = 0.1$; and the HMC is simulated with the following tuning parameters

begin
    anim_mh_mixture = demo_mh_gif(ℓ;xlim= [-9,9], ylim=[-6,6], qΣ= 0.1* Matrix(I,2,2))
    anim_hmc_mixture = demo_hmc_gif(ℓ, ∇ℓ; xlim= [-9,9], ylim=[-6,6], ϵ= 0.1, Trange=[10, 50])
end;
gif(anim_mh_mixture; fps= 10 )
gif(anim_hmc_mixture; fps=10)

It can be observed that HMC performs significantly better than the original MH sampler

Chain inspection

We can use Julia's MCMCChains.jl to carry out standard chain diagnosis.

traceplot visualisation

We first inspect the chains visually by using traceplot. The MH's trace plots show high temporal correlations between the samples (which implies the chain has not yet converged) while HMC's traces seem to mix well.

begin
    Random.seed!(100)
    spl_mh_mix, _, _, _ = metropolis_hastings(ℓ, 4000; Σ = 0.1 * Matrix(I, 2, 2))
    chain_mh_mix = Chains(spl_mh_mix[:, 2001:end]', [:θ₁, :θ₂])
    spl_hmc_mix, _, _, _ = hmc_sampling(ℓ, ∇ℓ, 4000, [0, 0]; ϵ=0.1, Τrange =[10,50])
    chain_hmc_mix = Chains(spl_hmc_mix[:, 2001:end]', [:θ₁, :θ₂])
end;

MH sampler's trace plot: note the high temporal correlations between the iterations

traceplot(chain_mh_mix)

HMC sampler's trace plot: both dimensions have mixed well

traceplot(chain_hmc_mix)

ess and efficiency

Efficiency metrics can also computed and compared between the two algorithms. For 2000 iterations (after the first 2000 discarded as burn-in), HMC produces around 1183 independent samples while there are only less than 17 effective sample contained in the original MH sample. HMC is therefore 60 fold more efficient than the ordinaly MH algorithm.

begin
    ess_stats = [mean(summarystats(chain_mh_mix)[:, :ess]), mean(summarystats(chain_hmc_mix)[:, :ess])]
    efficienty_stats = ess_stats/2000
    df = DataFrame(methods = ["MH", "HMC"], ess=ess_stats, efficiency=efficienty_stats)
end
methods ess efficiency
1 "MH" 16.1911 0.00809556
2 "HMC" 1183.27 0.591637

Conclusion

In this section, we have introduced MCMC, a class practical computational method for Bayesian inference. MCMC aims at generating Monte Carlo samples from a target distribution, where the unknown normalising constant is not required. All inference questions can then be calculated by Monte Carlo estimation, which is scalable even if the parameter space is of high dimensions.

However, sampling from a general high-dimensional distribution efficiently is not trivial. Traditional MCMC algorithms, such as Metropolis-Hastings and Gibbs sampling, suffer from random walk behaviour. More advanced algorithms, such as Hamiltonian sampler which employs gradient information of the target distribution, usually perform better.

Implementing MCMC algorithms by oneself clearly is not idea. Luckily, as an end user, one rarely needs to implement a MCMC algorithm. In the next chapter, we will see how probabilistic programming software such as Turing.jl simplifies the process. One only usually needs to specify a Bayesian model and leave the MCMC sampling task to the software. Nevertheless, the user still needs to be able to do practical convergence diagnoistics to check the quality of the sample, which arguably is the most important takeaway knowledge for an applied Bayesian inference user.

Some important concepts/terminologies about MCMC are summarised below:

Notes

1

This is a contrived question to illustrate the idea of MH algorithm. We do not need to run any algorithm to generate samples from a deterministic die. The sample should be all threes.

2

It depends on how $\tfrac{0}{0}$ is defined. We have assumed it is zero or NaN here.

3

Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of markov chain monte carlo 2.11 (2011): 2.

Appendix

Juia code used for this chapter

demo_mh_gif

begin
"""
    demo_mh_gif(ℓπ;)

Produce a gif that demonstrates how Metropolis-Hastings algorithm works

### Input

- `ℓπ` -- log pdf of the target distribution
- `qΣ` -- proposal distribution's variance, should be `dim` × `dim` symmetric and P.D. matrix
- `mc`  -- number of iterations to simulate
- `x₀` -- the initial starting point
- `xlim, ylim` -- horizontal and vertical limits of the plot

### Output

- `anim` -- an array of figures 

### Examples

```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
anim = demo_mh_gif(ℓπ; qΣ = 0.005 * Matrix(I,2,2))
gif(anim, fps= 10)
```
"""
    function demo_mh_gif(ℓπ; qΣ = 1.0 * Matrix(I,2,2), mc = 200, x₀= zeros(2), xlim=[-4, 4], ylim = [-4, 4])
        Random.seed!(100)
        chain, proposed, acpt, rate = metropolis_hastings(ℓπ, mc+1; Σ = qΣ, x0 = x₀)
        anim = produce_anim(ℓπ, chain, proposed, acpt, rate; mc = mc, xlim=xlim, ylim = ylim)
        return anim
    end


    
end
begin
"""
    demo_hmc_gif(ℓπ, ∇ℓπ; ϵ = 0.05, mc = 200, x₀= zeros(2), Trange=[100,200], xlim=[-4, 4], ylim = [-4, 4])

Produce a gif that demonstrates how Metropolis-Hastings algorithm works

### Input

- `ℓπ` -- log pdf of the target distribution
- `∇ℓπ` -- gradient of the log pdf of the target distribution
- `ϵ` -- step size of the HMC's Leap Frog simulation
- `mc`  -- number of iterations to simulate
- `x₀` -- the initial starting point
- `Trange` -- a range of possible steps of the Leap Frog algorithm
- `xlim, ylim` -- horizontal and vertical limits of the plot

### Output

- `anim` -- an array of figures 

### Examples

```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
∇ℓπ = (x) -> ForwardDiff.gradient(ℓπ, x)
anim = demo_hmc_gif(ℓπ, ∇ℓπ)
gif(anim, fps= 10)
```
"""
    function demo_hmc_gif(ℓπ, ∇ℓπ; ϵ = 0.05, mc = 200, x₀= zeros(2), Trange=[100,200], xlim=[-4, 4], ylim = [-4, 4])
        Random.seed!(100)
        chain, proposed, acpt, rate = hmc_sampling(ℓπ, ∇ℓπ, mc+1, x₀; ϵ = ϵ, Τrange =Trange)
        anim = produce_anim(ℓπ, chain, proposed, acpt, rate; mc = mc, xlim=xlim, ylim = ylim, title="HMC demonstration", with_accpt_rate=false)
        return anim
    end


    
end
begin
"""
    produce_anim(ℓπ, chain, proposed, acpt, rate; mc = 200, xlim=[-4, 4], ylim = [-4, 4], title="MH demonstartion; ")

Produce an animation that demonstrates how Metropolis-Hastings-liked algorithm works

### Input

- `ℓπ` -- log pdf of the target distribution
- `chain` -- the mcmc trace, should be a `dim` × `mc+1` matrix (including initial state)
- `proposed` -- proposal history
- `acpt` -- acceptance history, array of booleans
- `rate` -- acceptance rate
- `mc`  -- number of iterations to show
- `xlim, ylim` -- horizontal and vertical limits of the plot

### Output

- `anim` -- an array of figures 

### Examples

```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
chain, proposed, acpt, rate = metropolis_hastings(ℓπ, 201)
anim = produce_anim(ℓπ, chain, proposed, acpt, rate)
gif(anim, fps= 10)
```
"""
    function produce_anim(ℓπ, chain, proposed, acpt, rate; mc = 200, xlim=[-4, 4], ylim = [-4, 4], title="MH demonstartion; ", with_accpt_rate=true)
        if with_accpt_rate
            title = title * "acceptance rate=$(rate)"
        end
        plt_ = plot(range(xlim[1], xlim[2], length= 100), range(ylim[1], ylim[2], length= 100), (x, y)-> exp(ℓπ([x,y])), legend = :none, st=:contour, linewidth=0.5, la=0.5, levels=5, title=title)

        anim = @animate for i  in 1:mc
            scatter!(plt_, (chain[1, i], chain[2, i]),
                 label=false, mc=:green, ma=0.5)
            if !acpt[i]
                scatter!(plt_, (proposed[1, i], proposed[2, i]), label=false, mc =:red, ma=0.4)
                plot!([chain[1, i], proposed[1, i]], [chain[2, i], proposed[2, i]], st=:path, lc=:red, linestyle=:dot, la=0.5, label=false)
            end
            plot!(plt_, chain[1, i:i+1], chain[2, i:i+1], st=:path, lc=:green, la=0.5, label=false)
        end
        
        return anim
    end

end

metropolis_hastings

begin

"""
    metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))

Sample a target probability distribution by Metropolis-Hastings algorithm with random walk Gaussian proposal

### Input

- `ℓπ` -- log pdf of the target distribution
- `mc`   -- number of MC samples to simulate
- `dim`  -- dimension of the target distribution
- `Σ` -- proposal distribution's variance, should be `dim` × `dim` symmetric and P.D. matrix
- `x0` -- the initial starting point

### Output

- `samples` -- the `mc` iterations of samples, a `dim` × `mc` array
- `proposed` -- the history of the proposals, only for visualisation purpose
- `acceptance` -- whether the proposals being accepted or not
- `accpt_rate` -- acceptance rate of the chain

### Examples

```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
metropolis_hastings(ℓπ, 500; Σ = 1.0 * Matrix(I, 2, 2))
```
"""
    function metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
        samples = zeros(dim, mc)
        proposed = zeros(dim, mc-1)
        acceptance = Array{Bool, 1}(undef, mc-1)
        fill!(acceptance, false)
        samples[:, 1] = x0
        L = cholesky(Σ).L
        ℓx0 = ℓπ(x0) 
        accepted = 0
        for i in 2:mc
            # xstar = rand(MvNormal(x0, Σ))
            xstar = x0 + L * randn(dim)
            proposed[:, i-1] = xstar
            ℓxstar = ℓπ(xstar)
            r = exp(ℓxstar - ℓx0) 
            # if accepted
            if rand() < r
                x0 = xstar
                ℓx0 = ℓxstar
                accepted += 1
                acceptance[i-1] = true
            end
            @inbounds samples[:, i] = x0
        end
        accpt_rate = accepted / (mc-1)
        return samples, proposed, acceptance, accpt_rate
    end

end

gibbs_sampling

begin

"""
    gibbs_sampling(mc=500; μ= zeros(2), ρ=0.9, σ₁² = 1.0, σ₂²=1.0, x0=zeros(2))

Sample a bivariate Gaussian with mean ``\\mu`` and variance-covariance ``\\begin{bmatrix} \\sigma_1^2 & \\rho \\\\ \\rho & \\sigma_2^2 \\end{bmatrix}`` by Gibbs sampling. 


### Note
This only works for a bivariate Gaussian target distribution; it is not a general sampling algorithm

### Examples
gibbs\\_xs = gibbs\\_sampling(4000)

chain\\_gibbs = Chains(gibbs_xs')
"""
function gibbs_sampling(mc=500; μ= zeros(2), ρ=0.9, σ₁² = 1.0, σ₂²=1.0, x0=zeros(2))
    samples = zeros(2, mc)
    x₁, x₂ = x0[1], x0[2]
    σ₁, σ₂ = sqrt(σ₁²), sqrt(σ₂²) 
    k₁ = ρ * σ₂ / σ₁
    k₂ = ρ * σ₁ / σ₂
    σ₁₂ = σ₂ * sqrt(1 - ρ^2)
    σ₂₁ = σ₁ * sqrt(1 - ρ^2)
    @inbounds samples[:, 1] = x0
    for i in 2:mc
        if i % 2 == 0
            x₁ = rand(Normal(μ[1] + k₁ * (x₂ - μ[2]), σ₁₂))
        else
            x₂ = rand(Normal(μ[2] + k₂ * (x₁ - μ[1]), σ₂₁))	
        end
        @inbounds samples[:, i] = [x₁, x₂]
    end
    return samples
end
end

hmc_sampling

begin

"""
    hmc_sampling(ℓπ, ∇ℓπ, mc, x₀; ϵ, Τrange =[100,200])

Sample a target probability distribution by Hamiltonian Monte Carlo sampler (HMC)

### Input

- `ℓπ` -- log pdf of the target distribution
- `∇ℓ`   -- gradient of the log pdf
- `mc`  -- the number of samples to draw
- `x₀` -- the initial starting point
- `ϵ` -- the Leepfrog's step size
- `Trange` -- the min and max Leep Frog steps

### Output

- `samples` -- the `mc` iterations of samples, a `dim` × `mc` array
- `accpt_rate` -- acceptance rate of the HMC

### Examples

```julia
ℓπ = (x) -> logpdf(MvNormal([0,0], [1.0 0.9; 0.9 1.0]), x)
∇ℓπ = (x) -> ForwardDiff.gradient(ℓπ, x)
hmc_sampling(ℓπ, ∇ℓπ, 1000, randn(2);  ϵ= 0.05, Τrange = [100,200])
```
"""
    function hmc_sampling(ℓ, ∇ℓ, mc, x₀; ϵ, Τrange =[100,200])
        dim = length(x₀)
        samples = zeros(dim, mc)
        proposed = zeros(dim, mc-1)
        acceptance = Array{Bool, 1}(undef, mc-1)
        fill!(acceptance, false)
        samples[:, 1] = x₀
        g₀ = -1 * ∇ℓ(x₀)
        E₀ = -1 * ℓ(x₀)
        n_accept = 0 
        for i in 2:mc
            # initial momentum is Normal(0, 1)
            p = randn(dim)
            # evaluate old H(x₀, p) = E(x₀) + K(p)
            # where E(x₀) = - ℓ(x₀)
            H = (p' * p)/2 + E₀

            xnew, gnew = x₀, g₀
            # make Τ leapfrog simulation steps
            Τ = rand(Τrange[1]:Τrange[2])
            # optional
            ϵ = rand([-1, 1]) * ϵ
            for τ in 1:Τ
                # make half-step in p
                p = p - ϵ * gnew /2
                # make step in x
                xnew = xnew + ϵ * p
                # find new gradient
                gnew = -1 * ∇ℓ(xnew)
                # make half-step in p
                p = p - ϵ * gnew /2
            end
            #  find new value of E and then H
            proposed[:, i-1] = xnew 
            Enew = -1 * ℓ(xnew)
            Hnew = (p' * p)/2 + Enew
            dH = Hnew - H

            if dH < 0 
                accept = true
            elseif rand() < exp(-dH)
                accept = true
            else 
                accept = false
            end
        
            if accept
                x₀, g₀, E₀ = xnew, gnew, Enew
                n_accept = n_accept + 1
            end
            samples[:, i] = x₀
            acceptance[i-1] = accept
        end
        return samples, proposed, acceptance, n_accept/(mc-1)
    end
end
begin

    function hamiltonian_proposal_step(ℓπ, ∇ℓπ, x₀, ϵ, T)
        dim = length(x₀)
        g₀ = -1 * ∇ℓπ(x₀)
        E₀ = -1 * ℓπ(x₀)
        # initial momentum is Normal(0, 1)
        p = randn(dim)
        # evaluate old H(x₀, p) = E(x₀) + K(p)
        # where E(x₀) = - ℓπ(x₀)
        H = (p' * p)/2 + E₀
        xnew, ∇new = x₀, g₀
        xs = zeros(dim, T+1)
        xs[:, 1] = xnew
        # make Τ leapfrog simulation steps
        # Τ = rand(Τrange[1]:Τrange[2])
        # optional
        # ϵ = rand([-1, 1]) * ϵ
        for τ in 1:T
            # make half-step in p
            p = p - ϵ * ∇new /2
            # make a step in x based on the speed p
            xnew = xnew + ϵ * p
            # find new gradient
            ∇new = -1 * ∇ℓπ(xnew)
            # make half-step in p
            p = p - ϵ * ∇new /2
            xs[:, τ+1] = xnew 
        end
        return xs
    end


end
hamiltonian_proposal_step (generic function with 1 method)

Others

Code for referencing equations

js(x) = HypertextLiteral.JavaScript(x)
js (generic function with 1 method)
"""
`texeq(code::String)`
Take an input string and renders it inside an equation environemnt (numbered) using KaTeX
Equations can be given labels by adding `"\\\\label{name}"` inside the `code` string and subsequently referenced in other cells using `eqref("name")`
### Note
Unfortunately backward slashes have to be doubled when creating the TeX code to be put inside the equation
When Pluto will support interpretation of string literal macros, this could be made into a macro
"""
function texeq(code,env="equation")
    code_escaped = code 			|>
    x -> replace(x,"\\" => "\\\\")	|>
    x -> replace(x,"\n" => " ")
    println(code_escaped)
    @htl """
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
    <script src="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
    
    <script>
    katex.render('\\\\begin{$(js(env))} $(js(code_escaped)) \\\\end{$(js(env))}',currentScript.parentElement,{
                    displayMode: true,
                    trust: context => [
                        '\\\\htmlId', 
                        '\\\\href'
                    ].includes(context.command),
                    macros: {
                      "\\\\label": "\\\\htmlId{#1}{}"
                    },
                })
    </script>
    """
end
"""
`eqref(label::String)`
Function that create an hyperlink pointing to a previously defined labelled equation using `texeq()`
"""
eqref(label) = @htl """
<a eq_id="$label" id="eqref_$label" href="#$label" class="eq_href">(?)</a>
"""
@htl """
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
<style>
a.eq_href {
    text-decoration: none;
}
</style>
        <script src="https://cdn.jsdelivr.net/npm/katex@0.13.11/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
<script id="katex-eqnum-script">
const a_vec = [] // This will hold the list of a tags with custom click, used for cleaning listeners up upon invalidation
const eqrefClick = (e) => {
    e.preventDefault() // This prevent normal scrolling to link
    const a = e.target
    const eq_id = a.getAttribute('eq_id')
    window.location.hash = 'eqref-' + eq_id // This is to be able to use the back function to resume previous view, 'eqref-' is added in front to avoid the viewport actually going to the equation without having control of the scroll
    const eq = document.getElementById(eq_id)
    eq.scrollIntoView({
        behavior: 'smooth',
        block: 'center',
    })
}
const checkCounter = (item, i) => {
    return item.classList.contains('enclosing')	?
    i											:
    i + 1
}
const updateCallback = () => {
a_vec.splice(0,a_vec.length) // Reset the array
const eqs = document.querySelectorAll('span.enclosing, span.eqn-num')
let i = 0;
eqs.forEach(item => {
    i = checkCounter(item,i)
    console.log('item',i,'=',item)
    if (item.classList.contains('enclosing')) {
        const id = item.id
        const a_vals = document.querySelectorAll(`[eq_id=\${id}]`)
        a_vals !== null && a_vals.forEach(a => {
            a_vec.push(a) // Add this to the vector
            a.innerText = `(\${i+1})`
            a.addEventListener('click',eqrefClick)
        })
    }
})
}
const notebook = document.querySelector("pluto-notebook")
// We have a mutationobserver for each cell:
const observers = {
    current: [],
}
const createCellObservers = () => {
    observers.current.forEach((o) => o.disconnect())
    observers.current = Array.from(notebook.querySelectorAll("pluto-cell")).map(el => {
        const o = new MutationObserver(updateCallback)
        o.observe(el, {attributeFilter: ["class"]})
        return o
    })
}
createCellObservers()
// And one for the notebook's child list, which updates our cell observers:
const notebookObserver = new MutationObserver(() => {
    updateCallback()
    createCellObservers()
})
notebookObserver.observe(notebook, {childList: true})
invalidation.then(() => {
    notebookObserver.disconnect()
    observers.current.forEach((o) => o.disconnect())
    a_vec.forEach(a => a.removeEventListener('click',eqrefClick))
})
</script>
"""

Foldable details

begin
    begin
        struct Foldable{C}
            title::String
            content::C
        end
        
        function Base.show(io, mime::MIME"text/html", fld::Foldable)
            write(io,"<details><summary>$(fld.title)</summary><p>")
            show(io, mime, fld.content)
            write(io,"</p></details>")
        end
    end
end